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topic  in  the  course  was  to  be  multiple  regression,  but  Ardie  discovered 
that  mucli  of  the  class  did  not  have  sufficient  background  to  begin  this 
preliminary  topic.  Thus,  the  course  consisted  of  an  overview  of  statistics 
a la  Lubin  and  a thorough  coverage  of  linear  regression.  Unfortunately, 
Ardie  was  admitted  to  the  hospital  just  as  he  was  ready  to  begin  talking 
about  multiple  regression,  and  the  class  was  never  able  to  get  started 


again. 

Ardie  produced  a series  of  memos  for  those  first  sessions  which  should 
be  shared  with  all  vho  can  appreciate  his  giftedness  and  understanding  in 
the  field  of  statistics.  Those  memos  are  the  topic  of  this  technical 
report. 

Ardie  was  in  considerable  pain  when  he  wrote  these  memos,  which  caused 
him  to  make  several  mistakes.  I have  attenqjted  to  correct  these  errors, 
and  I am  to  blame  if  I have  not  done  so  adequately.  However,  the  words 
you  are  reading  are  basically  Ardie' s.  To  in^jrove  the  flow  of  the  memo, 
some  of  his  coirments  were  put  as  footnotes  and  I have  added  some  clarifi- 
cation notes  of  my  own.  ^ 

Ardie  promised,  at  the  beginning  of  one  memo,  to  detail  some  diagnostic 
checks.  At  the  end  of  that  memo,  the  same  checks  were  promised  in  the  next 
memo.  Unfortunately,  the  "next  memo"  was  never  written  as  Ardie 's  death 
occurred  on  October  9,  1976.  Since  Ardie  talked  about  the  checks  in  class,  P 

I have  included  them  in  my  own  words  at  the  end,  along  with  Ardie 's  "state- 
ment of  ethics."  \ 


(1975)  on  pages  16-19  and  242-3.  We  will  start  off  with  some  of  the  many 
ways  of  defining  the  product-moment  correlation  and  then  give  a short 
history  to  show  how  correlation  and  regression  came  to  be  linked.  Spe- 
cifically, the  relation  of  the  product-moment  correlation  to  the  slope 
coefficient  of  the  usual  linear  least-squares  model  will  be  given.  We 
will  point  out  that  vdiat  seem  to  be  very  slight  changes  in  the  specifi- 
cations of  the  linear  model  and  the  assunptions  can  ccmqjletely  alter  that 
relation. 

DEFINITIONS 

The  definitions  and  equations  presented  here  are  generally  drawn 
from  Kendall  and  Buckland's  A Dictionary  of  Statistical  Terms  (1971). 

Confutation  of  a product-mament  correlation  requires  two  variables, 
usually  denoted  X and  Y.  These  variables  are  quantitative,  i.e.,  at  least 
i graded.  Many  texts  call  X the  independent  variable  and  Y the  dependent 

variable.  Kendall  and  Buckland  (1971)  point  out,  however,  that  this 

usage  is  completely  inconfatible  with  the  standard  definition  of  statistical 

I 

independence  (see  p.  71).  Ihus,  psychometricians  use  the  terms  predictor 
j and  criterion  for  X and  Y,  respectively.  These  terms  avoid  prejudging  the 

\ factual  issue  of  dependence, 

i 

j * Editor's  Note:  Taken  fran  Ardie's  memo  dated  February  17,  1976. 

I 

} 

I 
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The  linear  regression  model  for  a population  can  be  written: 

(1)  Y.  * y + y^X.  + e.  * 

1 0 1 

where 

is  the  criterion  score  for  the  ith  subject 
is  the  predictor  score  for  the  ith  subject 
y^  is  the  intercept,  the  value  of  Y^  vdien  X^  » 0 
yi  is  the  slope  of  Y on  X,  the  change  in  Y vAien 
X increases  by  one  unit 

e^  refers  to  any  error  attributable  to  the  ith  subject 
When  we  con^xite  our  values  from  sane  finite  sanple  of  the  population, 
then  the  linear  regression  model  is: 

(2)  Y.  = w^  + WiX.  + e. 

where  w and  wi  refer  to  the  san^le  values  of  the  intercept  and  slope,  respectively. 

The  objectiw  of  the  linear  regression  procedure  is  to  find  Y^,  a 
predicted  criterion  score  for  the  ith  siA)ject,  where  Y^  is  equal  to  the  observed 
score  (Y^)  minii<t  the  error  for  that  subject  Y^  is  the  least  squares 

predicted  value,  that  is,  it  minimizes  the  squared  errors  sunmed  over  all 

** 

subjects. 

(a)  Population  values  are  usually  given  Greek  letters,  v*ile  sanple  values 
are  given  Latin  letters. 

* Editor's  Note:  A linear  equation  generally  takes  the  form  Y ■ a ♦ bX, 
vdiere  a refers  to  the  intercept  and  b to  the  slope. 

**  Editor's  Note:  The  discussion  of  how  this  is  minimized  is  postponed 
until  a later  section. 
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Let  be  the  deviation  of  from  the  mean  of  the  sanple  (i.e., 
x^  = - X)  or  the  deviate  score  for  short  and  let  be  the  deviate  score 

for  Y (i.e.,  * T)»  then  the  deviance  of  Y (dev  y)  is  equal  to  the 

sun  of  the  squared  deviates  of  Y about  the  sample  mean: 

(3)  dev  y = ly}  = [(Y^  - Y)^ 

and  (4)  dev  x = Jx^^  = ^(X^  - 

Most  texts  talk  about  the  "sum  of  squares"  instead  of  deviance.  I prefer 

deviance  because  it  is  less  ambiguous  (i.e.,  it  cannot  be  confused  with 

2 * 

2.x  , the  sun  of  the  squared  raw  scores)  and  less  clunsy.  The  usual 

calculation  for  dev  x is: 

(5)  dev  X = — (IX.)Vn 

Using  this  notation,  the  sain)le  variance  of  X is  given  by: 

2 

(6)  variance  x = dev  x/(N-l)  = s^ 

and  the  staiviard  deviation  is  the  square  root  of  the  variance. 

Another  useful  term  is  that  of  codeviance.  The  codeviance  of  X and  Y 
(i.e.,  codev  xy)  is  the  sum  of  the  cross-products  of  x^  and  y^,  the  deviate 
scores.  That  is: 

(7)  codev  xy  » ^x^y^^  = 

All  sunnations  will  nm  from  1 to  N,  the  sample  size,  unless  otherwise 
noted. 

Editor's  Note:  I always  used  the  term  "sum  of  squares"  until  I was 
indoctrinated  by  Ardie.  The  term  deviance  is  much  less  confusing. 
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Nbst  texts  use  the  term  covariance,  where; 

(8)  covariance  xy  ■ codev  xy/ (N-1) 

Before  we  discuss  the  Pearson  product -moment  correlation  and  the 

slope,  one  other  useful  concept  should  be  defined,  the  unit  deviate.  Unit 

deviate  scores  have  a mean  of  zero  and  a standard  deviation  of  one.  They 

are  found  by  dividing  each  deviate  score  in  the  san^jle  by  the  sample  standard 

deviation.  In  the  ranaining  discussion,  let  (u^  = be  the  unit  deviate 

score  for  X and  let  ® ^i^y^  unit  deviate  score  for  Y. 

The  Pearson  product-manent  correlation  is  denoted  by 

' saiqjle  and  by  p (pronounced  row)  for  the  population.  It  must  be  a number 

xy 

between  +I  and  -1.  The  product-moment  correlation  can  be  defined  in  a 
number  of  ways,  utilizing  the  various  concepts  presented  previously.  In 
terms  of  unit  deviates: 

(9)  = (Iu^v^)/N-1) 

when  u^  = v^,  then  = +1;  and  when  u^^  = -v^,  then  r^  = -1 


In  terms  of  codeviance 


(10)  r ■ codev  xy  / / dev  x / dev  y 


• Editor's  Note:  A unit  deviate  score  is  most  commonly  referred  to  as 
a z-score  or  a standard  score.  The  reader  may  find  this  transformation 
to  be  somevdiat  curtjersome  in  discussing  linwr  regression.  Howewr,  tne 
unit  deviate  transformation  is  very  useful  ^ multivariate  statistics, 
including  multiple  regression.  In  multivariate  statistics,  a set  or 
scores  can  be  represented  by  a vector.  Subtracting  the  sample  mean 
from  each  score,  i.e.,  finding  deviate  score^ 

the  origin  of  the  space  in  vdiich  it  is  defined.  Dividing  each  score 
by  the  sample  standard  deviation  greatly  simplifies  the  computation 
of  the  vector's  length. 
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or  covariance 


(11)  r = covar  xy  / / var  x / var  y 

xy 

(12)  r = covar  xy  / s s 

xy  X y 

where  s and  s are  sample  standard  deviations. 

X y 

One  additional  way  of  defining  the  product-moment  correlation  is 

very  useful  for  those  who,  like  me,  require  that  a statistic  be  checked  by 

doing  it  two  different  ways.  (Remember  that  redundant  information  is  the 

2 

only  information  worth  having).  Let  and  s ^ be  its  variance. 

Then 


(15)  - s2_,)  / 

Equation  13  uses  the  same  standard  deviations  in  the  denominator  as  equation 
12,  but  the  numerators  are  completely  independent  from  a con^juting  point  of 
view. 

Finally,  let  me  make  seme  points  about  the  relation  of  r to  w. , 

xy  1 

the  least  square  slope  of  Y on  X in  the  saii5)le. 


(14)  Wj  - r^(s/s^) 
or  an  algebraic  derivation 

(15)  Wj^  = codev  xy  / dev  x 

Suppose  we  tried  to  compute  the  linear  regression  weights  of  equation 
2 for  u^  and  v^,  i.e.,  suppose  we  tried  to  get  the  least-square  prediction  of 
v^,  the  unit  deviate  y score,  from  u^,  the  unit  deviate  x score.  It  is  easy 
« 0 and  w,  ■ r 

X • 


to  show  that  w^ 


Thus 
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(16)  and 

u.  = r V. 

1 xy  1 

Thus,  the  product-mcMnent  correlation  is  also  the  least-squares  regression 
slope  when  both  variables  are  in  the  form  of  unit  deviate  scores. 

Frcm  equation  14,  we  can  see  that  the  equality  of  the  correlation 
with  the  slope  will  always  occur  when  s^  = s^,.  More  fundamentally,  r^  is 
itself  a syame  trie  a 1 concept.  The  correlation  of  X with  Y is  identical  to 
the  correlation  of  Y with  X,  even  when  the  slope  of  Y on  X differs  from 
the  slope  of  X on  Y. 

Thus,  correlation  tends  to  be  used  descriptively  vdien  we  do  not 

distinguish  between  criterion  and  predictor  or  when  there  is  no  such 

distinction.  When  we  can  differentiate  clearly  between  the  variable  we 

wish  to  predict  and  the  variable  predicted  from,  the  slope  tends  to  be 

used  as  a descriptive  statistic.  Of  course,  the  two  statistics  are 

irrevocably  linked  in  that  w,  must  be  zero  if  r is  zero. 

1 xy 

Finally,  given  the  sample  slope  (Wj^)  and  the  mean  of  the  criterion 
and  predictor  scores  (i.e.,  Y and  X,  respectively),  then  it  is  easy  to 
calculate  the  intercept,  w^: 

(17)  Wq  = Y — w^X 

HISTORICAL  imrmJCTICW^^^ 

[A]  The  Two  Gaussian  Models 

As  is  very  coninon  in  the  history  of  statistics,  we  can  start  by  discussing 
the  work  of  the  "Prince  of  Mathematics,"  C.  F.  Gauss.  Although  Gauss  was  not 

(c)  Much  of  this  historical  material  and  discussion  is  taken  from 
Dudycha  § Dudycha  (1972)  and  Binder  (1959). 
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the  first  to  consider  the  problan  of  predicting  a criterion  from  a linear 
function  of  another  variable,  his  formulation  of  the  problem  was  used  for 
at  least  100  years.  Gauss  first  applied  his  "Method  of  Least  Squares*'  to 
the  problon  of  predicting  the  orbits  of  planets.  He  had  to  reconcile 
observations  made  by  different  astronomers  over  several  centuries  using 
very  different  techniques  of  measurement. 

In  his  first  formulation  of  the  problem,  Gauss  used  \diat  we  now  call 
the  "Method  of  Maximum  Likelihood."  Roughly  speaking,  this  means  that  the 
predicted  value,  Y^,  is  chosen  so  as  to  be  the  most  likely  value  for  all 
subjects  having  the  same  X score.  In  order  to  estimate  this  modal  value  of 
a set  of  observations.  Gauss  needed  to  know  the  theoretical  form  of  this 
distribution  of  the  errors  of  prediction.  He  took  this  to  be  the  so-called 
Normal  distribution.  Under  the  assun5)tion  that  the  errors  of  prediction 
were  distributed  normally,  the  method  of  least  squares  was  identical  to  the 
maximum  likelihood  method  and  gave  estimates  of  the  regression  coefficients 
with  all  kinds  of  optimal  properties: 

(1)  The  regression  estimates  were  unbiased.  That  is,  if  we 
average  Wj^  over  all  possible  sanqjles,  that  average  will 
equal  Vi,  the  population  value. 

(2)  The  Wj^  estimates  have  minimum  sampling  variance.  If  we 
take  all  the  possible  sample  values  of  and  calculate 
their  variance  (i.e.,  expected  mean  sqiiare)  about  their 
average,  Yj,  then  this  variance  is  equal  to  or  less  than 

; the  sampling  variance  of  any  other  set  of  slopes  estimated 

^ by  any  other  method. 

i 

1 

1(d)  Some  statisticians  now  prefer  to  call  it  the  Gaussian  distribution  in 

view  of  the  contributions  made  by  Gauss  to  its  definition  and  properties. 
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1 


i 

» 

I 

i 
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(3)  TTie  estimates  are  distributed  normally  about  yi,  making  exact 
tests  of  significance  possible. 

The  PLINC  Model:  The  full  set  of  assunptions  used  by  Gauss  in  this 
attempt  is  rather  formidable.  The  first  is  the  most  obscure  and  conmonly 
overlooked: 

P All  measures  of  X are  perfect  with  no  errors 
of  measurement. 

Mathematicians  and  statisticians  usually  say  that  the  X values  must  be  mathe- 
matical values,  whereas  the  Y values  can  be  expressions  of  a random  variable 
(i.e.,  Y^  can  possess  a stochastic  element).  Psychometricians,  on  the  other 
hand,  talk  about  the  Xs  being  infallible  measures  of  the  underlying  trait. 
When  X is  fallible  (i.e.,  has  an  error  of  measuranent) , then  the  usual 
least  squares  estimate  of  the  slope  is  biased  toward  zero. 

The  second  assumption  is  the  primary  assunq)tion  of  the  regression 
procedure : 

L For  each  value  of  X,  the  mean  of  the  corresponding  Y 
values  is  a linear  transform  of  X,  i.e..  Equation  1 
holds  in  the  population. 

Next  is  a trio  of  assumptions  that  is  always  made  when  a t-ratio  or 

fel 

F-ratio  test  is  performed: ^ ^ 

1^  Hie  errors  of  prediction  (i.e.,  the  residuals  in  the  model) 
are  independent  of  one  another  in  the  population. 

N The  errors  of  prediction  are  normally  distributed  about 
a true  value  of  zero. 

(e)  Neither  test  existed  in  Gauss's  day  in  exact  small  sample  form. 


J 
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C No  matter  which  value  of  X is  used  to  predict  Y,  the 
variance  of  the  errors  of  prediction  is  finite  and 
constant.  That  is,  Y is  homoscedastic. 

Part  of  the  last  assumption  is  sometimes  stated  separately  - the 
variance  of  the  errors  of  prediction  is  finite,  never  going  to  infinity. 
Since  this  must  be  true  for  any  finite  sanq^le  of  data,  we  will  take  this 
for  granted  along  with  the  fact  that  all  data  must  be  quantitative: 
qualitative  variables  must  be  transformed  into  graded  variables  for  the 
regression  model  to  apply. 

For  convenience,  I will  refer  to  equation  1 and  these  five  assumptions 
as  the  PLINC  model.  Ordinarily,  refereiKe  to  the  linear  regression  model 
in  a statistics  text  means  the  PLINC  model,  even  when  all  the  assUfl^tlOTis 
are  not  given  explicitly. 

The  PLI  Model:  In  1809  Gauss  published  his  method  of  least  squares 
based  on. the  likelihood  approach  and  these  five  assumptions.  However, 
after  many  years  of  uneasiness  with  this  approach,  he  reformulated  it 
(1821),  eliminating  the  assumptions  that  the  errors  of  prediction  were 
normally  distributed  with  constant  variance,  the  PLI  model.  If  the  least 
square  equa’-ions  are  solved  directly  with  no  reference  to  these  two 
assumptions,  one  still  gets  the  same  estimators: 

w = Y — w.X  and  w,  = codev  xy  / dev  x 

These  estimators  are  still  unbiased  and  have  minimum  sarqjling  variance 
within  the  class  of  linear,  unbiased  estimators. 

By  eliminating  the  assumptions  of  normality  and  constant  variance, 
the  least  squares  estimator,  Wj^,  will  have  only  the  smallest  san^ling 
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variance  of  the  set  of  slopes  that  result  from  using  a linear  unbiased 

estimator.  Under  the  full  PLINC  assmptions,  the  san^ling  variance  of  Wj 

was  the  smallest  possible,  no  matter  whether  linear  or  nonlinear  estimates 

were  being  con^ared.  Given  a non-normal  distribution  of  the  errors  of 

prediction  and/or  a systematic  change  in  their  variance  as  a function  of 

X,  some  nonlinear  unbiased  estimators  may  have  a smaller  sanpling  variance 

than  w and  w, . 
o 1 

The  Central  Limit  Theorem  states  that  any  weighted  sun  of  independent 
variables  tends  toward  a normal  distribution  as  the  nunber  of  weighted 

observations  increases  indefinitely.  By  definition,  Wj^  in  the  PLI  model 

* 

is  a weighted  sum  of  independent  observations.  Thus,  if  enough  cases  are 
used,  even  the  test  of  significance  based  on  the  normal  distribution  will 
hold  for  the  ordinary  least-squares  estimators,  w^  and  Wj^,  regardless  of 
the  true  distributions  of  X and  Y. 

Even  the  independence  assun¥)tion  can  be  modified  with  no  damaging 
consequences.  Suppose  that  the  errors  of  prediction  are  correlated 
instead  of  independent.  If  the  correlation  is  constant  (i.e.,  if  the 
correlation  between  any  pair  of  prediction  errors  is  equal  to  the 
correlation  between  any  other  pair),  then  the  properties  of  the  sanple  Wj^ 
values  are  identical  to  those  for  the  PLI  model  except  that  the  variance 
of  w^  noist  be  decreased  to  take  account  of  the  constant  correlation. 

Constant  intercorrelation  is  sometimes  called  the  compound  syirmetry 
condition  or  the  intraclass  covariance  pattern. 

[B]  The  Contributions  of  Galton  and  Pearson 

Gauss  also  generalized  his  findings  inmediately  to  the  multiple  predictor 

* Editor's  Note:  w^  is  a regression  weight 
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case,  and  applied  "multiple  regression"  to  problems  in  astronomy,  physics, 
mathematics,  etc.  Multiple  regression,  like  most  good  solutions,  was 
rediscovered  independently  by  a nurber  of  mathematicians.  In  addition, 
mathanaticians  developed  many  modifications  of  the  Gauss  PLINC  and  PLI 
model.  In  particular,  Bravais  of  France  considered  what  we  may  call  the 
bivariate-normal  model,  where  Y and  X are  mutually  linear  and  both  marginal 
distributions  are  normal. 

The  idea  of  a coefficient  measuring  the  goodness -of- fit  of  a straight 
line  to  a set  of  points  originated  independently  in  the  fertile  mind  of 
Francis  Galton  (1880).  Gal ton  was  very  much  concerned  about  the  regression 
of  English  intellect  to  the  moron  smd  idiot  level  through  uncontrolled 
breeding  among  the  lower  classes  and  undesired  immigrants  fran  Europe. 

Galton  was  not  a mathematician,  however,  and  his  idea  of  a correlation 
coefficient,  varying  from  +1  to  -1  according  to  the  degree  of  fit,  had 
little  ijtqmct  until  Karl  Pearson  noticed  it  in  Galton's  Natural  Inheritance 
(1880) . In  1895  Karl  Pearson  published  Contributions  to  the  Mathonatical 
Theory  of  Evolution in  vdiich  he  derived  the  large -sample  properties  of 
the  bivariate-normal  correlation  and  generalized  them  to  the  case  of  multiple 
correlation. 

Galton  used  the  term  regression  to  indicate  that  parents  with  extrone 
measures  on  height,  weight,  intelligence,  etc.  tended  to  have  children 
for  whom  these  measures  were  closer  to  the  mean.  Under  the  PLI  or  PLINC 
assunptions,  such  regression  to  the  mean  must  occur  vdien  is  not  unity. 


(f)  Both  Pearson  and  Galton  believed  in  the  panmixia  theory  of  inheritance: 
the  transmitted  elements  were  in  the  blood  and  were  continuous  in  nature, 
rather  than  discrete. 
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Pearson,  under  the  influence  of  Gal ton,  became  very  interested  in 
correlation  in  addition  to  using  the  regression  model  for  prediction.  In 
particular,  he  studied  the  bivariate  case  without  distinguishing  between 


predictor  and  criterion.  Thus,  Pearson's  bivariate -normal  model  was 
synmetrical , the  assmptions  for  X and  Y being  the  same: 

P All  measures  of  X and  Y are  perfect,  with  no  errors 
of  measurement. 

L Y is  a linear  function  of  X and  vice  versa. 

^ Independence  of  the  errors  of  prediction. 

M Marginal  normality:  Both  X and  Y are  distributed  normally. 
C Mutual  homoscedasticity.  The  variance  of  Y is  finite  and 
constant  for  all  values  of  X and  vice  versa. 


Spearman's  Attaiuation  Factor:  Charles  Spearman  (1904)  attacked  Pearson 
on  the  assunqption  of  perfect  measurement.  He  showed  that  if  random  errors 
of  measurement  were  added  to  both  X and  Y,  their  product -moment  correlation 
could  decrease  considerably  by  an  attenuation  factor.  Pearson  (1904)  did 
defend  his  perfect  measurement  assuiption,  but  I have  been  unable  to  find 
any  published  acknowledgement  by  Pearson  on  the  effect  of  fallible  measurement. 

Some  statisticians  (Williams,  1959;  Lindley,  1947)  now  speak  of  linear 
functional  relations.  Williams  states: 


"functional  relations ...  subsist  between  the  expected 
values  of  different  variables  and  will  not  therefore 
coincide  with  regressicMi  relations  unless  the  . . . 
variables  are  free  from  error  . . . When  the  . . . variables 
are  errorless,  their  observed  and  expected  values  coin- 
cide, so  that  the  regression  relation  is  the  same  as  the 
functional  relation,  and  both  may  be  estimated  by  the 
method  of  least  squares."  (Chapter  11) 
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Like  most  statistical  issues,  the  correct  procedure  to  use  depends  on  how 
the  question  is  phrased.  If  you  want  to  know  the  correlation  between  two 
observed  variables  or  the  best  least -squares  prediction  using  the  observed 
variables,  then  the  ordinary  least-squares  procedure  is  practical  and  should 
be  used.  But  if  you  want  to  know  the  correlation  between  the  true  variables, 
then  we  need  to  estimate  the  functional  relation  between  them.  Spearman's 
"correction  for  attenuation"  would  be  appropriate  here,  but  I have  never  seen 
a statistician  use  it.^-^ 

Reduction  of  assinptions.  Pearson  quickly  generalized  the  bivariate 
normal  nodel  to  the  multiple  correlation  case.  This  meant  that  the  PLIMC 
asstmptions  were  increased  manifold,  depending  iqx)n  the  nunber  of  predictors. 
Some  of  Pearson's  co-workers  felt  that  the  multitude  of  assinptions  required 
of  the  correlation  approach  swamped  its  usefulness.  Thus,  the  question  of 
interest  became:  Could  the  product-moment  correlation  be  used  even  when 
all  the  assui^tions  did  not  hold. 

G.  Udney  Yule  (1897)  discussed  the  correlation  concept  with  no  assiBf)tions 
except  the  existence  of  pairs  of  nunbers.  Even  under  these  conditions,  some 
properties  hold: 

(1)  The  correlation  must  be  between  +1  and  -1.  (2)  When 

the  standard  deviations  of  X and  Y are  equal,  then  the 
regression  of  Y on  X equals  the  regression  of  X on  Y 2 
vdiich  equals  the  product-moment  correlation.  (3)2When  r 
« 1,  all  points  lie  on  a straight  line.  (4)  As  r^  increases, 
the  sum  of  the  squared  deviations  from  the  straight  line  must 
decrease.  (5)  If  Y has  a non-linear  relati<Mi  to  X,  then  r2 
must  be  less  than  unity.  (6)  The  percent  of  deviance  of  Y 2 
that  is  accounted  for  by  a linear  transform  of  X is  given  by  r . 

2 

I find  these  properties  rather  disappointing.  Uhless  the  observed  r = *1, 

I cannot  be  sure  vdiat  it  means. 

Karl  Pearson  also  realized  that  the  PLIMC  assunptions  were  very  limiting. 


(g)  I would  have  thought  that  Karl  Pearson  was  interested  in  the 
inheritance  of  the  true  values  rather  than  the  observed  ones, 
but  
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He  published  (1911)  a derivation  of  multiple  correlation  which  sinply  used 
least  squares,  noting  that  the  computational  equations  were  all  the  same 
as  when  all  the  multivariate  normal  assixi^tions  were  used.  However,  he 
found  using  the  resulting  regression  weights  and  correlations  for  description 
or  inference  to  be  very  difficult  without  assuning  at  least  mutual  linearity 
and  marginal  normality. 

[C]  R.  A.  Fisher  and  the  Exact  S>nall-Sanq;)le  Approach 

Just  as  Pearson's  large -sample  approach  to  the  product -moment  correlation 
generated  nmny  novel  concepts  that  revolutionized  statistics,  R.  A.  Fisher's 
exact  small-sanple  approach  to  the  sampling  distribution  of  the  product- 
moment  correlation  was  responsible  for  the  next  great  step  in  the  evolution 
of  present  day  statistics.  To  see  vdiy  the  exact  small -saii5)le  approach  shook 
the  statistical  world  let  us  canpare  the  Pearson  and  Fisher  approach  to  the 
test  of  the  null  hypothesis  that  p = o. 

Pearson's  Approach.  Pearson  would  define  the  estimator  of  the  popu- 
lation parameter  to  be  used  for  any  finite  sample.  The  mean  and  variance 
then  could  be  found  without  reference  to  the  distribution  from  thich  the 
sample  was  taken.  If  the  population  estimator  was  a linear  function  of  the 
sanple  observations,  then  with  a large  sample  of  independent  observations 
the  Central  Limit  Theorem  would  show  that  this  estimator  tended  to  be 
distributed  normally  about  the  true  population  mean.  For  a linear  estimator, 
the  population  mean  and  variance  are  weighted  suns  of  the  mean  and  variance, 
respectively,  of  the  sample  observations.  For  a non-linear  estimator  such 
as  the  correlation  coefficient,  Pearson  would  expand  the  estimator  into  a 
Taylor  series  and  evaluate  enough  terms  to  get  a usable  approximation. 

Pearson  was  able  to  show  with  such  methods  that  the  expected  value  of  the 
sample  r converged  to  p,  the  population  value,  as  the  san^le  size  increased. 
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The  variance  of  r was  a function  of  only  p and  the  sai^jlc  size. 

Pearson's  usual  test  for  p * o was  to  divide  the  saiif)le  value  by  the 
large-sanqjle  approximation  to  the  staxtdard  deviation.  This  ratio  was  then 
referred  to  the  normal  distribution.  In  this  case,  the  test  was  to  treat 

(18)  z = r / N-1  / /i-r2 

/ 

as  a unit  normal  deviate.  Pearson  generally  used  a very  stringent  level  for 
significance,  z = 3 or  more. 

Fisher’s  Approach.  R.  A.  Fisher,  who  was  familiar  with  Pearson’s  work, 
saw  a discussion  by  Soper  (1914)  on  the  distribution  of  the  product -moment 
correlation  in  a bivariate  normal  population  (i.e.,  with  an  infinite  N) . In 
a few  weeks  he  sketched  out  the  exact  solution  and  sent  it  to  Pearson.  Later, 
he  published  a paper  (1915)  giving  the  exact  distribution  of  the  product- 
moment  correlation  for  any  finite  sanqjle  frcrni  a bivariate  normal  population 
(LIMC  assun5)tions) . 

Fisher’s  (1915)  paper  indicated  that  vdien  p = o,  the  expected  value 
(i.e.,  the  average  overall  possible  saitples  of  the  same  finite  size)  of  the 
sanqjle  product-moment  correlation  is  zero.  In  other  words,  when  p = o in  a 
bivariate  normal  population,  the  Pearson  product -mcanent  correlation  is 
unbiased. However,  this  property  does  not  make  the  Pearson  test  of  the 
null  hypothesis  an  accurate  one  with  small  samples. Fisher  reconmended 
using  the  following  ratio: 

(19)  t - r / N-2  / ^ l-r2 

(h)  Many  investigations  using  all  kinds  of  non-normal  distributions  uniformly 
find  that  the  expected  value  of  r is  zero  when  p ■ o.  The  distribution  of 

r is  almost  normal,  being  symmetrical  about  zero  with  diminishing  probabilities 
as  r approaches  ♦ 1. 

(i)  For  Fisher  and  Pearson  a small  san^le  was  100  or  less. 
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Although  r is  almost  normally  distributed  under  the  hypothesis  that  p = o, 
the  Pearson  ratio  (equation  18)  is  not  normally  distributed  and  the  t-ratio 
(equation  19)  is  not  normally  distributed  for  small  samples.  Fisher  showed 
that  the  t-ratio  was  generally  greater  than  its  corresponding  unit  normal 
deviate,  but  that  the  difference  between  the  two  diminished  to  near  zero 
as  the  sanqjle  size  increased  to  several  hundred  cases.  A special  table, 
taking  the  sample  size  into  account,  was  constructed  for  the  t-ratio. 

Thus,  the  Pearson  test  of  the  hypothesis  that  p = o is  definitely  biased 
in  favor  of  finding  a significant  deviation  from  the  null  hypothesis. 

However,  the  worse  case  is  when  p is  not  zero.  As  p deviates  towards  +1, 
the  distribution  of  the  sanple  product -moment  correlation  becomes  more  and 
more  skewed,  with  a long  tail  stretching  towards  -1.  This  causes  a bias  in 
the  expected  valiie  of  r of  approximately: 

(20)  -p(l-p2)  / 2(N-1) 

As  p deviates  towards  -1,  the  bias  changes  sign. 

In  sunmary,  r is  an  unbiased  estimate  in  only  three  cases:  when  p equals 

-1,  0,  or  +1.  This  is  not  a particularly  welcane  conclusion  for  anyone.  As 

* 

a result,  Fisher  developed  his  z transformation  of  r , which  almost  eliminates 
the  bias  and  almost  normalizes  the  distribution.  He  reconinended  the  trans- 
formation vdienever  (1)  testing  if  an  observed  correlation  differs  significantly 


(j)  Pearson  and  his  coworkers  were  very  much  aware  of  the  fact  that  their 
statistical  procedures  were  highly  dependent  on  the  use  of  large  sairples. 

In  1947  I had  the  privilege  of  attending  a class  on  probability  calculus 
by  Florence  Nightingale  David,  a former  assistant  of  K.  Pearson.  In  very 
brief  conversations,  I gathered  that  san^le  sizes  of  less  than  100  were 
considered  rather  incident,  if  not  actually  sinful,  by  workers  in  the 
Pearson  lab.  In  a later  memo,  when  we  consider  the  "bouncing  beta  weights" 
and  the  shadowy  semi -diaphanous  suppressor  variables  of  multiple  regression, 
we  shall  see  the  difficulty  of  justifying  the  use  of  small  sanples  in 
multiple  regression. 

* Editor's  Note:  Ardie  did  not  present  Fisher's  r to  e transfoim.  It  is 
Biven  bv  z - 1.1513  login  [(l+r)/(l-r)]  with  a standard  error  equal  to 
l/(/N-3  ). 


r 


from  a given  theoretical  value;  (2)  testing  for  a significant  difference 
between  two  observed  correlations;  and  (3)  in  combining  independent 
estimates  of  a correlation  to  obtain  a better  one. 
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Fisher's  Approach  to  the  Least-Squares  Regression  Coefficients.  The 

least-squares  regression  coefficients,  w^  and  Wj^,  are  unbiased  estimates 

of  the  true  pbpulation  values,  regardless  of  the  value  of  the  population 

correlation.  Unfortunately,  Fisher  was  unable  to  derive  the  sampling 

distribution  and  variance  of  w and  w,  for  all  combinations  of  N,  p,  o , o 

o 1 ’ x’  y 

and,  as  far  as  I know,  the  exact  small  sanple  distribution  for  all  combi- 
nations of  parameters  has  never  been  determined. 

Like  most  mathematicians,  Fisher  changed  the  question  to  one  where 


he  knew  the  answer. 


(k) 


Let  us  go  back  to  equation  15  for  the  least -squares 

* 

slope  and  rewrite  it  in  terms  of  deviate  scores,  x^  and  y^: 


2 

(21)  Wj^  = codev  xy/dev  x = Ix^y^^/^x^ 

2 

In  other  words,  each  y^  is  weighted  by  x^/J[xj^  . Since  distribution  of  a 
weighted  sum  of  independent  normally-distributed  variables  is  well-known 
to  be  normal,  and  the  variance  of  the  weighted  sum  is  eqioal  to  the  sun  of 
the  weighted  variances,  we  are  practically  home  free.  All  that  remains  is 
to  assume  that  the  y^  values  (1)  are  independent  or  have  the  same  mutual 
intercorrelation;  (2)  are  normally  distributed  for  each  fixed  value  of  X; 


I (k)  If  you  can't  be  near  the  girl  you  love,  then  you  love  the  girl  you're  near. 

* Editor's  Note:  See  page  3 for  a discussion  of  deviate  scores. 
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(3)  have  constant  variance;  and  (4)  that  Y is  a linear  function  of  X.  These 
assuiqjtions  of  Fisher  are  frequently  called  the  FLING  model,  which  is  the 
same  as  the  Gaussian  PLINC  model  except  that  X is  fixed  (F).^^^  The  LING 
assumptions  apply  only  to  the  Y values,  just  as  in  the  Gaussian  model. 

The  standard  error  of  estimate is  a crucial  statistic  for  tests  of 
significance  in  least-squares  linear  regression  under  Fisher's  FLING  model. 
Let  us  define  this  statistic  in  stages.  As  indicated  by  equation  2,  the 
observed  score,  Y^,  is  equal  to  the  predicted  score,  Y^  (vdiere  Y^^  = w^+Wj^X^) , 
plus  the  error  of  prediction,  e^.  If  we  transform  each  term  to  unit  deviate 
form,  we  get  the  following  equation: 

(22)  y^  * w^x^  + e^* 

If  we  now  square  each  term  and  sun  over  all  values  of  i,  we  then  have  an 
expression  in  terms  of  deviance. 

(23)  ly^^^  = 

(24)  dev  y = dev  y + dev  e 


(l)  With  these  assun^tions,  we  have  restricted  our  inferences  considerably  by 
making  statements  of  significance  and  estimation  that  refer  only  to  future 
samples  having  exactly  the  same  distribution  of  the  N values  of  X.  This  is 
known  as  a co^itional  test  of  significance,  of  vdiich  there  are  many  exanqjles 
in  statistics  (e.g. , all  randomization  tests,  the  Chi  square  test.  Hotelling's 
test  of  two  correlated  correlations,  etc.). 

(m)  Also  called  the  standard  error  of  prediction  or  the  standard  deviation  of 

the  errors  of  prediction.  \ 

* Editor's  Note:  I previously  pointed  out  that  transforming  scores  to  their  j 

unit  deviate  form  would  locate  the  vector  defining  those  scores  at  the  I 

origin  of  the  space  in  which  the  vector  is  defined.  TTius,  the  intercept,  | 

w , is  eliminated  vrtien  the  scores  are  transformed  to  unit  deviate  form.  j 

o t 
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That  is,  the  deviance  of  the  observed  scores  is  equal  to  the  deviance  of  the 
predicted  scores  plus  the  error  deviance. 

The  error  variance  can  now  be  defined  as: 

(26)  = dev  e / (N-2) 

2 

or  since  dev  e = dev  y (1-r  ),  as 

(27)  = dev  y (l-r^)/(N-2) 

To  point  out  the  distinction  between  the  criterion  y and  the  predictor  x,  one 
can  write  the  error  variance  as 

(28)  s ^ = s^  ^ 

e y.x 

vdiere  y.x  is  a convenient  short-hand  for  the  prediction  error. 

Fisher  (1915)  showed  that  the  best  estimate  of  the  saitq)ling  variance  of 
Wj^,  under  the  FLING  assimpti ons , is; 

(29)  / dev  X 

2 

(n)  A useful  outcome  of  equation  23  is  that  r is  equal  to  the  deviance  of  the 
predicted  scores  divided  by  the  deviance  of  the  observed  scores: 

2 

(25)  r = dev  y / dev  y 

2 

Since  the  deviance  of  the  observed  scores  is  the  total  deviance,  the  r 
value  is  the  proportion  of  the  Y deviance  predicted  by  the  best  fitting 
linear  function  of  X and  (l-r^)  i^ the  proportion  of  the  Y deviance 
that  cannot  be  predicted  from  this  function.  One  cannot  refer  to  r^ 
as  the  proportion  of  the  varian^  of  Y that  can  be  linearly  predicted 
from  X,  because  the  divisor  needed  to  obtain  the  variance  is  not 
constant.  To  convert  dev  y into  Sy,2,  one  must  divide  by  (N-1);  whereas 
to  convert  dev  e into  3^2 ^ one  must  divide  by  (N-2).  Of  course,  for 
large  samples  this  discrepancy  is  of  no  practical  inportance. 
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In  order  to  compare  the  saiT5)le  slope,  Vy  with  an  hypothesized  slope,  ^i, 
Fisher  reccsimended  the  ratio  of  the  difference  to  the  standard  error: 

(30)  t = (Wj  - H-i)  *^dev  X / ^ 

When  Yj  is  hypothesized  to  be  zero,  the  t-ratio  for  the  slope  is  numerically 
identical  to  the  t-ratio  for  the  product -moment  correlation  given  in  equation 
19.  Thus,  when  testing  whether  a least -squares  slope  is  significantly 
different  from  zero,  the  assunjtion  that  all  future  samples  will  have 
exactly  the  same  set  of  X values  is  not  necessary.  We  can  fall  back  on 
the  PLINC  and  LINC  models,  depending  upon  vdiether  we  are  trying  to  infer 
the  linear  functional  relation  or  simply  trying  to  do  a prediction  job 
using  the  available  measures,  respectively. 

W.  S.  Cosset.  Fisher  was  not  the  first  to  work  out  the  exact  small - 
sample  distribution  of  a conmon  statistic.  This  honor  goes  to  W.  S.  Cosset, 
a student  of  Karl  Pearson,  who  worked  as  a chemist  for  the  Guiness  brewery 
(and  presumably  was  allowed  only  small  samples  of  the  brew.').  Cosset 
wrote  under  the  name  of  Student  to  preserve  ccmmercial  security. 

In  1908  Student  published  a paper  giving  the  exact  small  san?)le  distri- 
bution of  the  ratio: 

(31)  z * (1(  - ij)/s^ 

where  X is  the  sample  mean,  p is  the  population  mean,  and  s^  is  the  sanple 
standard  deviation  with  (N-1)  as  the  denominator. 

Fisher  did  not  know  of  Student's  (1908)  paper  vdien  he  published  his  1915 
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paper.  Later,  he  gave  full  credit  to  Student  and  declared  that  one  of  the 
chief  purposes  of  Statistical  Methods  for  Research  Workers  was  to  make  the 
work  of  Student  appreciated  and  better  known.  Student  was  unable  to  prove 
that  the  distribution  of  the  standard  deviation  he  was  using  was  the  correct 
one,  so  later  Fisher  (1925)  gave  a rigorous  proof  of  Student's  results. 
Fisher  also  invented  the  actual  form  of  the  t-ratio,  since  Student  used  the 
z-ratio  in  his  1908  paper  (equation  31)  vdiich  does  not  generalize  quite  as 
easily  as  the  t-ratio. 

Fisher's  (1915)  paper  was  received  much  like  Student's  (1908)  paper  -- 
with  deafening  apathy.  Fisher  spent  the  next  four  years  teaching  physics 
and  mathematics.  About  1919  he  was,  almost  simultaneously,  offered  a post 
with  K.  Pearson  and  a position  as  a statistician  at  the  Rothamsted  Agri- 
cultural Station.  He  chose  Rothamsted.  Judging  from  vdiat  we  now  know  of 
the  two  personalities  involved,  he  would  not  have  remained  long  with  K. 
Pearson. 

[D]  The  Pitman  Permutation  Test 

As  happens  so  frequently  in  statistics,  the  permutation  test  started 
with  the  work  of  R.  A.  Fisher.  Fisher  (1925,  section  21)  was  atten^jting 
to  answer  the  criticism  that  the  Student  t-test  is  seriously  limited 
because  it  requires  that  the  observations  be  drawn  from  a normal  distri- 
bution. He  asked  whether  a materially  different  result  would  be  obtained 
if  one  assumes  that  all  observations  are  independently  drawn  from  the  same 
population  (i.e.,  the  homerous  assuiption)  without  specifying  a normal 
distribution  for  this  conmon  population. 

The  exainjle  being  considered  was  a test  of  the  difference  between 
correlated  means  with  15  pairs  of  observations.  If  each  (X,Y)  pair  was 


1 


22 


truly  drawn  at  random  from  the  two  series,  then  X-Y  would  have  occurred 
as  frequently  as  Y-X.  In  all,  some  2^^  = 32,768  average  differences  could 
be  generated  by  such  random  ccmijinations.  The  observed  average  difference 
of  20.933  was  exceeded,  in  the  positive  or  negative  direction  from  zero, 
by  5.2671  of  the  32,768  chance  combinations.  TTie  t-ratio  was  2.148,  which 
gives  a two-tailed  level  of  4.971  by  the  normal  based  table.  Thus,  Fisher 
argued  that  the  requirement  of  normality  is  not  a serious  limitation,  at 
least  in  this  case,  since  almost  the  same  significance  level  can  be  obtained 
by  just  assuming  independent  data  from  the  same  population. 

Other  statisticians  saw  almost  inmediately  that  the  randomization 
or  permutation  method  could  be  used  to  make  a conditional  test  of  the  null 
hypothesis  for  any  statistic  idiere  the  t-ratio  or  F-ratio  would  ordinarily 
be  used.  In  particular,  E.  J.  G.  Pitman  (1937)  applied  the  permutation 
method  to  the  product-moment  correlation  to  test  the  null  hypothesis  that 

p = o. 

In  principle,  the  test  is  simple.  We  calculate  the  value  of  the 
observed  correlation  in  the  usual  way.  Then  we  permute  the  order  of  the 
Y values  and  recalculate  the  correlation.  This  procedure  is  repeated 
until  a product-moment  correlation  is  obtained  for  each  possible  permu- 
tation of  the  order  of  the  Y values.  For  N pairs  of  observations,  there 
are  N!  permutations  and  thus  N!  product-mcHnent  correlations.  These  NI 
correlations  correspond  to  the  sampling  distribution  of  coefficients  under 
the  randomization  null  hypothesis:  That  is,  the  sanple  was  drawn  from  a 
population  in  which  the  X,Y  pairs  were  formed  by  chance.  The  N!  correlations 
are  arranged  in  order  of  magnitude  and  the  percent  of  coefficients  greater 
than  the  observed  product -moment  correlation  is  calculated.  When  the 
percentage  of  permutation  coefficients  exceeding  the  observed  product-moment 
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correlation  is  less  than  5%,  one  can  reject  the  null  hypothesis  at  the  5%  level.* 

Table  1 


Exercise  on  Randcmiization  Test 


Y 

a 

b 

c 

d 

e 

f 

g 

h 

0 

0 

0 

0 

0 

0 

0 

1 

1 

2 

1 

1 

2 

2 

9 

9 

0 

0 

5 

2 

9 

1 

9 

1 

2 

2 

9 

9 

9 

2 

9 

1 

2 

1 

9 

2 

r 

xy 

.936 

. 355 

.876 

rZD9 

-.146 

313 

w 

yx 

.34 

.84 

-.14 

-.22 

.86 

“o 

2.98 

1.48 

3.40 

4.42 

4.66 

1.42 

3.10 

Y 

i 

3 

k 

1 

m 

n 

0 

p 

0 

1 

1 

1 

1 

2 

2 

2 

2 

2 

2 

2 

9 

9 

0 

0 

1 

1 

5 

0 

9 

0 

2 

1 

9 

0 

9 

9 

9 

0 

2 

0 

9 

1 

9 

0 

^xy 

7772 

w 

yx 

.74 

.02 

-.24 

-.40 

.76 

.12 

.70 

-.02 

1.78 

3.94 

4.72 

5.20 

1.72 

3.64 

1.90 

4.06 

Y 

a 

r 

s 

t 

u 

V 

w 

X 

0 

2 

2 

9 

9 

9 

9 

9 

9 

2 

9 

9 

0 

0 

1 

1 

2 

2 

5 

0 

1 

1 

2 

0 

2 

0 

1 

9 

1 

0 

2 

1 

2 

0 

1 

0 

^xy 

-.438 

-.521 

-.521 

-.605 

00 

1 

-.751 

-.730 

-.813 

w 

yx 

-.42 

-.50 

-.50 

-.58 

-.56 

-.72 

-.70 

-.78 

w 

o 

5.26 

5.50 

5.50 

5.74 

5.68 

6.16 

6.10 

6.34 

* Editor's  Note:  Ardie  handed  out  an  exercise  on  the  randomization  test 
for  the  product -moment  correlation.  This  exercise  is  included  as  Table  1 
as  an  illustration  for  the  leader. 
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Note  the  highly  conditional  nature  of  the  permutation  test.  Strictly 
speaking,  the  significance  of  the  observed  correlation  is  being  judged  only 
with  reference  to  a population  having  marginal  distributions  of  X and  Y 
that  are  identical  to  the  sample.  Thus,  any  inference  about  the  value  of 
the  population  correlation  holds  only  for  future  samples  with  exactly  those 
marginal  distributions  of  X and  Y.  Of  course,  few  researchers  are  very 
strict  about  their  inferences  when  they  publish  such  results,  but  it  is 
salutary  to  realize  that  the  marginal  distributions  of  X and  Y for  the 
drawn  sample  may  not  represent  your  population.  If  so,  then  your  inference 
about  the  population  correlation  may  be  sadly  askew. 

Pitman  was  able  to  prove  some  generalizations  about  the  moments  of 
the  permutation  distribution  of  correlations  under  the  null  hypothesis. 

Perhaps  his  most  important  conclusion  was  that  as  N increases  indefinitely, 
the  distribution  of  the  permutation  correlations  tends  toward  the  distribution 
of  the  bivariate  normal  product -moment  r.  Thus,  the  Pitman  permutation  test 
for  a large  N is  identical  to  the  Fisher  t test  for  the  null  hypothesis  that 
p = 0 (i.e.,  equation  19). 

The  Pitman  results  are  worth  presenting  in  some  detail.  The  average 
of  all  N!  permutation  correlations  will  always  be  exactly  zero,  and  the 
variance  of  the  N*.  coefficients  about  zero  is  exactly  equal  to  the  inverse 
of  (N-1).  The  third  moment  of  the  distribution,  is: 

(32)  = Ir^/Nl  = (N-2)’r^^7y./N(N-l)2 

where  and  y^^  refer  to  the  skevmess  of  the  X and  Y distributions, 
respectively.  The  general  equation  for  skewness  is: 

(53)  • E (Xj/op2 
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jg**  where  E denotes  expectation  over  all  possible  values  of  X,  and  represents 

j the  deviation  from  the  population  mean  of  X,  Thus,  will  be  positive 

I if  there  is  a long  positive  tail;  negative  if  there  is  a long  negative  tail; 

I 

and  zero  if  the  distribution  is  syninetric  about  the  mean,  as  in  the  case  of 

I a normal  distribution. 

Equation  32  inplies  that  if  both  X and  Y have  skewed  distributions,  then 

2 

some  extremely  high  values  of  r will  occur  in  the  permutation  distribution. 

If  both  variables  are  skewed  either  negatively  or  positively,  then  the 
distribution  of  permutation  correlations  will  have  some  extremely  large 
positive  correlations  vhen  the  skewed  observations  from  X and  Y happen  to 

I 

coincide.  When  one  variable  has  a positive  skew  and  the  other  a negative 
skew,  then  the  distribution  will  contain  some  correlations  close  to  -1. 

1 

^ When  one  of  the  variables  has  no  skew,  then  the  problem  of  bivariate  skew 

vanishes  and  the  distribution  of  the  permutation  correlations  will  have  no 
skew.  Finally,  the  most  iinxjrtant  implication  of  equation  32  is  that  the 
skew  of  the  permutation  correlation  distribution  must  approach  zero  as  N 
increases. 

The  standardized  fourth  moment  (i.e.,  the  Iiirtosis)of  the  permutation 
correlation  distribution  will  equal  3/(N-l)(N+l)  if  either  X or  Y has  a 
' Kurtosis  of  zero.  Moreover,  as  N increases  the  Kurtosis  of  the  permutation 

I correlation  distribution  approaches  3/(N-l) (N+1) , irregardless  of  the 

Kurtosis  of  either  X or  Y. 

The  importance  of  all  this  is  that  Fisher's  test  of  significance  for 
the  product-moment  correlation  (equation  19)  can  be  used.  Thus,  using  only 
the  assumptions  of  independence  of  the  N pairs  and  mutual  linearity  of  X and  Y, 
I we  are  able  to  use  the  same  test  of  significance  as  the  bivariate  normal 

f 

I 
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nodel  with  the  LIMC  assm^tions. 

What  is  the  moral  of  the  Pitman  story?  First,  if  you  simply  concen- 
trate on  one  question,  v«rtiether  the  null  hypothesis  is  true  in  the  population, 
a conditional  test  of  significance  is  always  possible  with  very  few 
assuinjtions.  Second,  the  large  sanple  version  of  the  permutation  test  of 
significance  may  (and  generally  does)  coincide  with  the  exact,  test  of 
significance  used  when  all  the  bivariate  normal  assmptions  hold  in 
population. 

There  are  two  disadvantages  to  the  Pitman  permutation  test.  First, 
the  test  is  conditional  and  thus  limited  to  the  population  vdiere  the  marginal 
distributions  of  X and  Y are  identical  to  those  observed  in  the  sanq)le. 
Second,  by  the  time  the  sanple  size  is  large  enough  for  one  to  expect  a 
stable  correlation  (i.e.,  N=20),  the  number  of  permutation  correlations 
required  to  obtain  the  distribution  has  soared  to  the  millions.  The  only 
saving  grace  here  is  that  if  either  X or  Y has  a symmetric  distribution 
(i.e.,  a skew  of  zero)  and  is  very  flat  (i.e.,  the  Kurtosis  is  near  zero), 
then  the  usual  t-test  is  valid  for  small  samples  without  the  necessity  of 
constructing  the  entire  permutation  correlation  distribution. 


[E]  The  Spearman  Correlation  for  Ranked  Observations 

C.  Spearman  in  his  article  (1904)  raising  the  issue  of  errors  of 
measurement  and  their  attenuation  effect  on  the  prodix:t-mainent  correlation, 
also  suggested  that  each  variable  be  ranked  from  1 to  N and  a product -moment 
correlation  calculated  for  the  ranks.  Pearson  (1907)  responded  favorably 


to  this  latter  suggestion  and  worked  extensively  on  the  characteristics 
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of  rank  correlations . 

One  umnediate  question  was  how  to  determine  the  significance  of  the 
Spearman  rank  correlation.  First,  when  p = o,  the  variance  of  the  sample 
value  is  1/(N-1).  This  is  identical  to  Pitman's  (1937)  result  for  the 
permutation  product -moment  r distribution.  However,  like  the  product- 
moment  correlation,  the  Spearman  rank  correlation  has  distinctly  non-norml 
distributions  for  small  samples  or  when  p is  near  ^1.  Thus,  the  significance 
of  an  observed  rank  correlation  (hereafter  designated  sr)  cannot  be  tested 
by  dividing  it  by  its  variance. 

In  1936,  H.  Hotelling  and  M.  Pabst  used  the  permutation  method  of 
generating  the  sampling  distribution  of  the  sr  in  the  null  case  that  p = 

They  managed  to  work  out  the  exact  distribution  of  sr  for  N=2  to  N=7.  For 
N=7  they  had  to  calculate  71  = 5040  values  with  no  electronic  coinxiter  to 
help  them,  so  they  stopped  there.  However,  Kendall,  Kendall,  and  Babington 
(1938)  soon  after  conpited  the  8!  = 40,320  values  of  sr  for  N=8;  and  David, 
Kendall,  and  Stuart  (1951)  later  pnjblished  the  exact  probability  levels 
for  N=9  and  N=10.  As  far  as  I know,  no  one  has  worked  out  the  111  = 39,916,800 
Spearman  rank  correlations  for  N=ll,  but  there  is  no  practical  need  to  do  so. 

The  Fisher  t-test  for  the  significance  of  a bivariate  normal  product -moment 
correlation  will  give  almost  exactly  the  correct  levels  of  significance  for 
sr  with  N greater  than  10.  In  other  words,  all  that  is  necessary  is  to 
substitute  sr  for  r in  equation  19. 

(o)  Incidentally,  Spearman  and  Pearson  spent  much  of  their  workiiig  lives  in  close 
physical  proximity  at  University  College,  London.  Spearman  was  head  of  the 
Psychology  Department  and  Pearson  was  head  of  the  Bicmietrics  and  Eugenics 

I Laboratories,  so  the  two  men  were  separated  only  by  a courtyard,  y^parently, 

i they  had  no  particular  liking  for  one  another.  Their  disputes  stimulated  the 

^ development  of  some  important  concepts  and  led  Pearson  to  exhaustive  mathe- 

matical investigations  for  which  Speaiman  had  no  training. 

(p)  The  use  of  the  permutation  method  was  apparently  quite  independent  of 
Fisher  or  Pitman. 


I 
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Table  2 illustrates  the  degree  of  approximation  involved.  Here  I have 
listed  the  .05  one- tail  values  of  the  Pearson  product -mcHnent  correlation 
based  on  the  PLIMC  assumptions  of  the  bivariate-normal  model  and  the 
transformation  to  the  Fisher  t-ratio.  Since  the  Spearman  rank  correlation 
can  only  take  on  a finite  set  of  discrete  values,  most  of  the  sr  values 
given  in  Table  2 are  fiction  in  the  sense  that  no  such  sr  values  could 
ever  result  for  the  given  sanple  size  of  N.  The  sr  values  were  obtained 
by  linear  interpolation  from  the  exact  probabilities  and  tend  to  be  a bit 
too  high. 


Table  2 

Values  of  the  Spearman  Rank -Order  and  Pearson 
Product-Moment  Correlations  at  the  .05  one-tailed 
Level  of  Significance 


N 

df«(N-2) 

^^^.05 

4 

2 

.900 

.987 

5 

3 

.805 

.908 

6 

4 

.729 

.774 

7 

5 

.669 

.695 

8 

6 

.621 

.637 

9 

7 

.582 

.583 

10 

8 

.549 

.546 

Table  2 shows  that  as  N increases,  the  two  critical  .05  levels  approach 
one  another,  until,  for  N greater  than  10,  they  differ  only  in  the  third 
decimal  place.  Practically  all  saiq)le  sizes  above  10  are  large  sanf)les 
when  testing  sr  for  significance  at  the  .05  one-tail  level. 

Hotelliitg,  Pabst,  and  Kendall  generated  the  sampling  distribution  of  the 
Spearman  rank  correlation  under  the  null  hypothesis  exactly  in  the  way  called 
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for  by  the  Pitman  permutation  method.  Therefore,  the  moments  of  the  sanpling 
distribution  of  sr,  under  the  null  hypothesis,  can  be  deduced  from  the  general 
equations  given  by  Pitman.  The  average  of  all  N!  rank  correlations  will  be 
zero  and  the  variance  will  be  1/(N-1).  For  the  Spearman  rank  correlation 
both  X and  Y have  been  transformed  to  rectangular  distributiois,  which  are 
symmetric  about  (N+l)/2.  Thus,  the  skew  coefficients  for  X and  Y are  zero 
and  so  is  the  sampling  distribution  of  the  Spearman  rank  correlaticMi  under 
the  null  hypothesis.  The  fourth  moment  is  given  by: 

(34)  = 3K/(N^-1) 

where  K = 12(N+6)/25N  (N-1).  As  N increases,  K goes  to  zero. 

The  san^ling  distribution  of  sr  is  identical  to  the  E’earson  Type  II 
symnetrical  distribution,  except  for  the  fourth  moment  \Aich  becomes  similar 
as  N increases.  Fisher  (1915)  proved  that  the  exact  distribution  of  the 
product-moment  correlation,  under  the  LDC  assumptions  with  p “ o,  is  the 
Pearson  Type  II  symmetrical  distribution.  Therefore,  to  the  extent  that 
the  moments  of  the  Spearman  rank  correlation  sampling  distribution  approach 
those  of  the  Pearson  Type  II  distribution,  the  exact  Fisher  t-test 
(equation  19)  will  be  a good  approximate  test  for  the  significance  of  sr. 

I view  the  Spearman  rank  correlation  as  an  excellent  alternative  to 
the  product-moment  correlation  when  the  question  is  vdiether  there  is  a 
significant  association  between  X and  Y.  When  you  have  drawn  the  subjects 
independently  and  assured  yourself  of  a mutual  linear  regression,  then 
transforming  X and  Y into  ranks  gets  rid  of  the  assumptions  of  marginal 
normality  and  homoscedasticity. 

However,  rank  correlation  is  not  an  answer  to  the  prediction  of  the 
observed  Y scores  from  the  observed  X scores  with  the  least  error.  For  the 
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same  number  of  subjects,  sr  will  have  less  statistical  power  (i.e.,  the 
probability  that  the  experimenter  will  accept  the  alternative  hypothesis) 
than  r.  Hotelling  and  Pabst  (1936)  found  that  for  large  samples  r 
attained  the  same  power  as  sr  with  91%  of  the  sample  size.  Thus,  the 
asymptotic  efficiency  of  sr  is  91%  of  that  of  r when  the  bivariate  normal 
assunqjtions  are  met.  Of  course,  sr  may  be  more  efficient  than  r when  the 
bivariate  normal  assunptions  are  not  met.  In  fact,  sr  is  the  most  efficient 
measure  of  correlation  vdien  the  marginal  distributions  of  X and  Y are  logistic 
functions. 

ADCENDIM;  DIAGNOSTIC  CHECKS 

In  ray  opinion,  the  right  to  compute  any  statistic  you  wish,  even  a 
product-moment  correlation,  is  guaranteed  by  the  first  Article  of  the  Bill 
of  Rights  just  as  much  as  is  freedom  of  speech  or  religion.  But  just  as 
freedom  of  speech  is  not  an  adequate  response  to  the  charge  of  libel  or 
publishing  false  and  misleading  commercial  information,  just  so  the  right 
to  coinxite  a statistic  is  not  the  right  to  publish  false  or  misleading 
information.  Just  as  the  judges  and  the  courts  stand  guard  over  the  rights 
of  citizens  and  try  to  prevent  abuse  of  freedom  of  speech,  j'ost  so  editors 
and  referees  stand  guard  over  the  rights  of  the  reader  and  try  to  prevent 
publication  of  statistics  that  misrepresent  the  data,  contain  spurious  ^ 

elements,  or  are  just  plain  false. 

Ordinarily  then,  the  question  of  »diether  a statistic  should  be  conpjted 
is  not  an  ethical  question  for  me.  If  you  think  it  would  yield  information 
of  interest  to  you,  then  of  course  the  answer  is  'Hell,  yes,  conpite  it.' 

I But  there  is  an  ethical  violation,  for  me,  vtfien  someone  attempts  to  publish 

a statistic  before  (1)  checking  for  rasnerical  accuracy,  (2)  checking  that 
essential  parts  of  the  model  do  fit  the  daU,  (3)  checking  that  artifact 

( 
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or  spurious  elements  have  not  distorted  the  numerical  value  of  the  statistic, 
(4)  making  sure  that  the  published  level  of  significance  has  not  been  dis- 
torted by  nultiple  comparisons,  etc. 

Moral  standards  change  from  time  to  time  in  statistics  as  well  as  in 
social  fields.  At  present,  very  few  would  agree  that  diagnostic  checks, 
of  how  well  the  model  fits  the  data,  are  just  as  much  a responsibility  of 
the  research  worker  as  numerical  checks  of  accuracy.  To  my  surprise,  some 
editors  apparently  don't  even  agree  on  the  necessity  for  nunerical  checks 
of  accuracy.  Inevitably,  one  can  point  to  wildly  inaccurate  statistics 
being  published  in  such  journals.  I would  hope  in  the  near  future,  that 
editors  will  insist  on  statements  from  the  authors  regarding  such  checks 
just  as  they  now  insist  on  statements  regarding  the  hmane  treatment  of 
subjects. 

Ardie  recommended  the  following  steps  as  diagnostic  checks  for  the 
linear  regression  model  for  the  one -predictor  case. 

(1)  Test  for  linearity: 


Compute  F =•  (r  ^ - r„.^)/^^'^q  ^ 


xy 


lN-3 


It 


2 2 
where  r^  is  the  Pearson  product-moment  correlation  and  r^  is  the  quadratic 

correlaticm  coefficient. 


Coiqxite  F 


or 


(CR-r  2) 


2 

where  CR  is  the  correlation  coefficient  and  r is  the  Pearson  product-moment 
correlation. 
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If  either  of  these  are  nonsignificant,  go  to  step  4.  Otherwise,  go 
to  step  2. 

(2)  Test  for  quadratic  model: 

Compute:  F 


If  significant  a nonquadratic,  nonlinear  model  describes  the  data.  If  non- 
significant, the  quadratic  model  is  sufficient. 

(3)  Transform  the  data  to  obtain  linearity  and  repeat  steps  1 and  2. 

(4)  Test  for  skewness: 

Ccm?)are  the  Pearson  product-moment  correlation  with  the 
Spearman  rank  order  correlation.  These  results  should  agree  with  step  1. 

(5)  Plot  the  data.  These  results  should  agree  with  those  obtained 
in  steps  1 and  2. 
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